machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

, one-class classification (OCC), also known as unary classification or class-modelling, tries to ''identify'' objects of a specific class amongst all objects, by primarily learning from a

training set In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from ...

containing only the objects of that class, although there exist variants of one-class classifiers where counter-examples are used to further refine the classification boundary. This is different from and more difficult than the traditional

classification Classification is the activity of assigning objects to some pre-existing classes or categories. This is distinct from the task of establishing the classes themselves (for example through cluster analysis). Examples include diagnostic tests, identif ...

problem, which tries to ''distinguish between'' two or more classes with the training set containing objects from all the classes. Examples include the monitoring of helicopter gearboxes, motor failure prediction, or the operational status of a nuclear plant as 'normal': In this scenario, there are few, if any, examples of catastrophic system states; only the statistics of normal operation are known. While many of the above approaches focus on the case of removing a small number of outliers or anomalies, one can also learn the other extreme, where the single class covers a small coherent subset of the data, using an information bottleneck approach.

Overview

The term one-class classification (OCC) was coined by Moya & Hush (1996) and many applications can be found in scientific literature, for example

outlier detection In data analysis, anomaly detection (also referred to as outlier detection and sometimes as novelty detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority of ...

anomaly detection In data analysis, anomaly detection (also referred to as outlier detection and sometimes as novelty detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority of ...

novelty detection Novelty detection is the mechanism by which an intelligent organism is able to identify an incoming sensory pattern as being hitherto unknown. If the pattern is sufficiently salient or associated with a high positive or strong negative utility, ...

. A feature of OCC is that it uses only sample points from the assigned class, so that a representative sampling is not strictly required for non-target classes.

Introduction

SVM based one-class classification (OCC) relies on identifying the smallest hypersphere (with radius r, and center c) consisting of all the data points. This method is called Support Vector Data Description (SVDD). Formally, the problem can be defined in the following constrained optimization form,

\min_ r^2 \text , , \Phi(x_i) - c, , ^2 \le r^2 \;\; \forall i = 1, 2, ..., n

However, the above formulation is highly restrictive, and is sensitive to the presence of outliers. Therefore, a flexible formulation, that allow for the presence of outliers is formulated as shown below,

\min_ r^2 + \frac\sum_^\zeta_i

\text , , \Phi(x_i) - c, , ^2 \le r^2 + \zeta_i \;\; \forall i = 1, 2, ..., n

From the

Karush–Kuhn–Tucker conditions In mathematical optimization, the Karush–Kuhn–Tucker (KKT) conditions, also known as the Kuhn–Tucker conditions, are first derivative tests (sometimes called first-order necessary conditions) for a solution in nonlinear programming to be ...

for optimality, we get

c = \sum_^\alpha_i\Phi(x_i),

where the

\alpha_i

's are the solution to the following optimization problem:

\max_\alpha \sum_^\alpha_i\kappa(x_i, x_i) - \sum_^\alpha_i\alpha_j\kappa(x_i, x_j)

subject to,

\sum_^\alpha_i = 1 \text 0 \le \alpha_i \le \frac \text i = 1,2,...,n.

The introduction of kernel function provide additional flexibility to the One-class SVM (OSVM) algorithm.

PU (Positive Unlabeled) learning

A similar problem is PU learning, in which a

binary classifier Binary classification is the task of classifying the elements of a set into one of two groups (each called ''class''). Typical binary classification problems include: * Medical testing to determine if a patient has a certain disease or not; * Qual ...

is constructed by

semi-supervised learning Weak supervision (also known as semi-supervised learning) is a paradigm in machine learning, the relevance and notability of which increased with the advent of large language models due to large amount of data required to train them. It is charact ...

from only ''positive'' and ''unlabeled'' sample points. In PU learning, two sets of examples are assumed to be available for training: the positive set

P

and a ''mixed set''

U

, which is assumed to contain both positive and negative samples, but without these being labeled as such. This contrasts with other forms of semisupervised learning, where it is assumed that a labeled set containing examples of both classes is available in addition to unlabeled samples. A variety of techniques exist to adapt supervised classifiers to the PU learning setting, including variants of the

EM algorithm EM, Em or em may refer to: Arts and entertainment Music * Em, the E minor musical scale * Em, the E minor chord * Electronic music, music that employs electronic musical instruments and electronic music technology in its production * Encyclopedia ...

. PU learning has been successfully applied to

text Text may refer to: Written word * Text (literary theory) In literary theory, a text is any object that can be "read", whether this object is a work of literature, a street sign, an arrangement of buildings on a city block, or styles of clothi ...

, time series,

bioinformatics Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...

tasks, and

remote sensing Remote sensing is the acquisition of information about an physical object, object or phenomenon without making physical contact with the object, in contrast to in situ or on-site observation. The term is applied especially to acquiring inform ...

data.

Approaches

Several approaches have been proposed to solve one-class classification (OCC). The approaches can be distinguished into three main categories, density estimation, boundary methods, and reconstruction methods.

Density estimation methods

Density estimation methods rely on estimating the density of the data points, and set the threshold. These methods rely on assuming distributions, such as Gaussian, or a

Poisson distribution In probability theory and statistics, the Poisson distribution () is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time if these events occur with a known const ...

. Following which discordancy tests can be used to test the new objects. These methods are robust to scale variance. Gaussian model is one of the simplest methods to create one-class classifiers. Due to Central Limit Theorem (CLT), these methods work best when large number of samples are present, and they are perturbed by small independent error values. The probability distribution for a d-dimensional object is given by:

p_(x;\mu;\Sigma) = \frac\exp\

Where,

\mu

is the mean and

\Sigma

is the covariance matrix. Computing the inverse of covariance matrix (

\Sigma^

) is the costliest operation, and in the cases where the data is not scaled properly, or data has singular directions pseudo-inverse

\Sigma^+

is used to approximate the inverse, and is calculated as

\Sigma^T(\Sigma \Sigma^T)^

Boundary methods

Boundary methods focus on setting boundaries around a few set of points, called target points. These methods attempt to optimize the volume. Boundary methods rely on distances, and hence are not robust to scale variance. K-centers method, NN-d, and SVDD are some of the key examples. K-centers In K-center algorithm,

k

small balls with equal radius are placed to minimize the maximum distance of all minimum distances between training objects and the centers. Formally, the following error is minimized,

\varepsilon_ = \max_i ( \min_k , ,  x_i - \mu_k , , ^2 )

The algorithm uses forward search method with random initialization, where the radius is determined by the maximum distance of the object, any given ball should capture. After the centers are determined, for any given test object

z

the distance can be calculated as,

d_(z) = \min_k , ,  z - \mu_k , , ^2

Reconstruction methods

Reconstruction methods use prior knowledge and generating process to build a generating model that best fits the data. New objects can be described in terms of a state of the generating model. Some examples of reconstruction methods for OCC are, k-means clustering, learning vector quantization, self-organizing maps, etc.

Applications

Document classification

The basic Support Vector Machine (SVM) paradigm is trained using both positive and negative examples, however studies have shown there are many valid reasons for using ''only'' positive examples. When the SVM algorithm is modified to only use positive examples, the process is considered one-class classification. One situation where this type of classification might prove useful to the SVM paradigm is in trying to identify a web browser's sites of interest based only off of the user's browsing history.

Biomedical studies

One-class classification can be particularly useful in biomedical studies where often data from other classes can be difficult or impossible to obtain. In studying biomedical data it can be difficult and/or expensive to obtain the set of labeled data from the second class that would be necessary to perform a two-class classification. A study from The Scientific World Journal found that the typicality approach is the most useful in analysing biomedical data because it can be applied to any type of dataset (continuous, discrete, or nominal). The typicality approach is based on the clustering of data by examining data and placing it into new or existing clusters. To apply typicality to one-class classification for biomedical studies, each new observation,

y_0

, is compared to the target class,

C

, and identified as an outlier or a member of the target class.

Unsupervised Concept Drift Detection

One-class classification has similarities with unsupervised concept drift detection, where both aim to identify whether the unseen data share similar characteristics to the initial data. A concept is referred to as the fixed probability distribution which data is drawn from. In unsupervised concept drift detection, the goal is to detect if the data distribution changes without utilizing class labels. In one-class classification, the flow of data is not important. Unseen data is classified as typical or outlier depending on its characteristics, whether it is from the initial concept or not. However, unsupervised drift detection monitors the flow of data, and signals a drift if there is a significant amount of change or anomalies. Unsupervised concept drift detection can be identified as the continuous form of one-class classification. One-class classifiers are used for detecting concept drifts.

References

{{reflist, 30em Statistical classification Classification algorithms